Sampling & Resampling
Sampling
Three methods in applied machine learning
Simple Random Sampling
- Samples are drawn with a uniform probability from the domain.
Systematic Sampling
- Samples are drawn using a pre-specified pattern, such as at intervals.
Stratified Sampling
- Samples are drawn within pre-specified categories (i.e. strata).
Cluster Sampling
- Samples are drawn from clusters
Sampling error
Two main types of errors include selection bias and sampling error.
-
Selection Bias
Caused when the method of drawing observations skews the sample in some way. -
Sampling Error
Caused due to the random nature of drawing observations skewing the sample in some way.
Resampling
The problem of sampling is that we only have a single estimate of the population parameter, with little idea of the variability or uncertainty in the estimate.
One way to address this is by estimating the population parameter multiple times from our data sample. This is called resampling.
Serveral resampling methods include permutation, Bootstrap, Jackknife and cross validation.
The best figure telling the story of resampling:
Permutation
- also called exact tests, randomisation tests, or randomisation tests
- Exchanging labels on data points when performing significance tests
Bootstrap
- Estimating precision / accuracy of sample statistics or validating models
- by drawing randomly with replacement from a set of data points.
Jackknife
- Estimating precision / accuracy of sample statistics through using subset of data
Cross validation
- Validating models by using random subsets.
Comparisons between permutations and bootstrap:
The difference between permutation and bootstrap is that bootstraps sample with replacement, and permutations sample without replacement.
The permutation test is best for testing hypotheses and bootstrapping is best for estimating confidence intervals.